Entropy-Based Cluster Validation and Estimation of the Number of Clusters in gene Expression Data

نویسندگان

  • Natalia Novoselova
  • Igor Tom
چکیده

Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis

Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...

متن کامل

Entropy-based Consensus for Distributed Data Clustering

The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in t...

متن کامل

پهنه‌بندی پیوسته هدایت الکتریکی- اسیدیته خاک بر اساس خوشه‌بندی فازی برای دشت قم

Electrical conductivity and acidity of soil are the most important chemical factors of soil for agriculture. The nature of soil is in such a way that its change has a continuous form. The method that can take into account this continuity will be able to show a better picture of change in soil characteristics. Objectives of this research are to investigate the relations between measured electric...

متن کامل

Estimation of geochemical elements using a hybrid neural network-Gustafson-Kessel algorithm

Bearing in mind that lack of data is a common problem in the study of porphyry copper mining exploration, our goal was set to identify the hidden patterns within the data and to extend the information to the data-less areas. To do this, the combination of pattern recognition techniques has been used. In this work, multi-layer neural network was used to estimate the concentration of geochemical ...

متن کامل

Target Detection Improvements in Hyperspectral Images by Adjusting Band Weights and Identifying end-members in Feature Space Clusters

          Spectral target detection could be regarded as one of the strategic applications of hyperspectral data analysis. The presence of targets in an area smaller than a pixel’s ground coverage has led to the development of spectral un-mixing methods to detect these types of targets. Usually, in the spectral un-mixing algorithms, the similar weights have been assumed for spectral bands. Howe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of bioinformatics and computational biology

دوره 10 5  شماره 

صفحات  -

تاریخ انتشار 2012